Analitica Descriptiva#
Importacion de paquetes#
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
import statsmodels.api as sm
from scipy import stats
import psycopg2
Carga de datos#
Empezaremos el proceso de exploratorio de los datos con la carga de los mismos:
conn = psycopg2.connect(
dbname="neondb",
user="neondb_owner",
password="YexCUXob4Oy7",
host="ep-little-feather-a5rorog4.us-east-2.aws.neon.tech",
port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM companies")
# Obtener los datos y guardarlos en un DataFrame
data2 = cursor.fetchall()
data = pd.DataFrame(data2)
data.columns=['company', 'sector', 'horizon_days', 'amount', 'date_BUY_fix',
'date_SELL_fix', 'price_BUY', 'price_SELL', 'Volatility_Buy',
'Volatility_sell', 'Sharpe_Ratio', 'expected_return_yearly',
'inflation', 'nominal_return', 'investment', 'ESG_ranking', 'PE_ratio',
'EPS_ratio', 'PS_ratio', 'PB_ratio', 'NetProfitMargin_ratio',
'current_ratio', 'roa_ratio', 'roe_ratio']
Visualizaremos los datos:
data=pd.DataFrame(data)
data.head()
| company | sector | horizon_days | amount | date_BUY_fix | date_SELL_fix | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | ... | investment | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HMC | AUTO | 7 | 500.0 | 2016-06-14 | 2016-06-21 | 25.600000 | 26.410000 | 0.274761 | 0.274900 | ... | GOOD | 28.5 | 13.50 | 1.64 | 0.32 | 0.00 | 2.40 | 1.11 | 1.93 | 4.87 |
| 1 | MSFT | TECH | 330 | 1500.0 | 2016-07-22 | 2017-06-16 | 52.356224 | 66.406387 | 0.279043 | 0.160044 | ... | GOOD | 14.7 | 23.18 | 2.05 | 4.19 | 5.15 | 22.53 | 2.35 | 11.28 | 27.29 |
| 2 | BAC | BANK | 15 | 25000.0 | 2015-01-28 | 2015-02-12 | 13.609470 | 14.925651 | 0.228531 | 0.232711 | ... | GOOD | 26.3 | 42.16 | 0.38 | 1.77 | 0.69 | 4.62 | 0.94 | 0.21 | 1.88 |
| 3 | KSS | RETAIL | 270 | 50000.0 | 2015-01-28 | 2015-10-26 | 47.355690 | 36.093388 | 0.221473 | 0.292144 | ... | BAD | 12.9 | 10.87 | 4.26 | 0.50 | 1.55 | 4.56 | 1.95 | 6.03 | 14.78 |
| 4 | AAPL | TECH | 600 | 10000.0 | 2015-05-21 | 2017-01-10 | 30.166021 | 28.201811 | 0.217025 | 0.227415 | ... | BAD | 16.5 | 14.06 | 2.02 | 3.19 | 5.08 | 22.53 | 1.16 | 19.56 | 39.44 |
5 rows × 24 columns
Por medio de las funciones describe() y info() obtendremos una visión preliminar del tipo de datos que tenemos del DataFrame.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 900 non-null object
1 sector 900 non-null object
2 horizon_days 900 non-null int64
3 amount 900 non-null float64
4 date_BUY_fix 900 non-null object
5 date_SELL_fix 900 non-null object
6 price_BUY 900 non-null float64
7 price_SELL 900 non-null float64
8 Volatility_Buy 900 non-null float64
9 Volatility_sell 900 non-null float64
10 Sharpe_Ratio 900 non-null float64
11 expected_return_yearly 900 non-null float64
12 inflation 900 non-null float64
13 nominal_return 900 non-null float64
14 investment 900 non-null object
15 ESG_ranking 900 non-null float64
16 PE_ratio 900 non-null float64
17 EPS_ratio 900 non-null float64
18 PS_ratio 900 non-null float64
19 PB_ratio 900 non-null float64
20 NetProfitMargin_ratio 900 non-null float64
21 current_ratio 900 non-null float64
22 roa_ratio 900 non-null float64
23 roe_ratio 900 non-null float64
dtypes: float64(18), int64(1), object(5)
memory usage: 168.9+ KB
Manipulación de la data#
Podemos observar como la variable “Unname: 0”, “company”, “date_BUY_fix” y “date_SELL_fix” no aportan significado al proceso predictivo al brindar información muy indiviual de cada dato. Por lo que, los eliminamos.
clean = ["company", "date_BUY_fix", "date_SELL_fix"]
data_1 = data.drop(clean, axis =1)
data_1.head()
| sector | horizon_days | amount | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | Sharpe_Ratio | expected_return_yearly | inflation | ... | investment | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AUTO | 7 | 500.0 | 25.600000 | 26.410000 | 0.274761 | 0.274900 | 0.274761 | -0.000176 | -0.2 | ... | GOOD | 28.5 | 13.50 | 1.64 | 0.32 | 0.00 | 2.40 | 1.11 | 1.93 | 4.87 |
| 1 | TECH | 330 | 1500.0 | 52.356224 | 66.406387 | 0.279043 | 0.160044 | 0.279043 | 0.197311 | -0.2 | ... | GOOD | 14.7 | 23.18 | 2.05 | 4.19 | 5.15 | 22.53 | 2.35 | 11.28 | 27.29 |
| 2 | BANK | 15 | 25000.0 | 13.609470 | 14.925651 | 0.228531 | 0.232711 | 0.228531 | 0.007771 | -0.5 | ... | GOOD | 26.3 | 42.16 | 0.38 | 1.77 | 0.69 | 4.62 | 0.94 | 0.21 | 1.88 |
| 3 | RETAIL | 270 | 50000.0 | 47.355690 | 36.093388 | 0.221473 | 0.292144 | 0.221473 | 0.005071 | -0.5 | ... | BAD | 12.9 | 10.87 | 4.26 | 0.50 | 1.55 | 4.56 | 1.95 | 6.03 | 14.78 |
| 4 | TECH | 600 | 10000.0 | 30.166021 | 28.201811 | 0.217025 | 0.227415 | 0.217025 | 0.403993 | -0.5 | ... | BAD | 16.5 | 14.06 | 2.02 | 3.19 | 5.08 | 22.53 | 1.16 | 19.56 | 39.44 |
5 rows × 21 columns
Pie Plot#
conteo_categorias = data_1['investment'].value_counts()
# Crear un DataFrame con los datos del conteo de categorías
df_conteo = pd.DataFrame({'investment': conteo_categorias.index, 'investment': conteo_categorias.values})
# Crear el gráfico de pastel con Plotly
fig = px.pie(df_conteo, values='investment', names='investment',
title='Distribución de la Variable investment')
# Mostrar el gráfico
fig.show()
Histograma#
fig = px.histogram(data_1, x='investment', color='investment',
labels={'investment': 'invesment', 'count': 'Frecuencia'},
title='Distribución de la Variable Investment')
# Actualizar diseño del gráfico
fig.update_layout(xaxis_title='Variable Categórica',
yaxis_title='Frecuencia')
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["inflation"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["inflation"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de inflation',
xaxis=dict(title='inflation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["nominal_return"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["nominal_return"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de nominal_return',
xaxis=dict(title='inflation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["amount"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["amount"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de nominal_return',
xaxis=dict(title='amount'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["Sharpe_Ratio"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["Sharpe_Ratio"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de Sharpe_Ratio',
xaxis=dict(title='inflSharpe_Ratioation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["roa_ratio"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["roa_ratio"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de roa_ratio',
xaxis=dict(title='roa_ratio'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
Podemos evidenciar como las variables se distribuyen y como esto se relaciona con la categoría investment.
Diagrama de dispersióin#
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='nominal_return', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'nominal_return': 'nominal_return'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='amount', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'amount': 'amount'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='Sharpe_Ratio', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'Sharpe_Ratio': 'Sharpe_Ratio'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='expected_return_yearly', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'expected_return_yearly': 'expected_return_yearly'})
# Mostrar el gráfico
fig.show()
fig = px.scatter_3d(data_1, x='inflation', y='nominal_return', z='expected_return_yearly', color='investment',
title='Gráfico de Dispersión',
labels={'inflation': 'inflation', 'nominal_return': 'nominal_return', 'expected_return_yearly': 'expected_return_yearly', 'investment': 'investment'})
# Mostrar el gráfico
fig.show()
Comparando con variable Respuesta#
Podemos observar como las variables se comportan según la categoría de investment que presenten. siendo así como “inflation” y “nominal_ return” cuentan con el comportamiento más particular.
for column in data_1.columns:
fig = px.box(data_1, x= "investment", y=column, color = "investment")
fig.show()
Es posible observar como se evidenciaba en el arbol de clasificación que los variables con mayor incidencia eran Inflation y nominal return.
Sector, horizon day, amount, price sell, price buy, volatility buy, volatility sell, sharpe ratio, expect return, ESG ranking, PE ratio, EPS ratio, PS ratio, NetProfitmargin, current ratio, roa ratio y roe ratio no presentan aparentemente una fuerte tendencia hacia una categoria. Es decir, visualmente, no parecen ser determinantes en clasificar a una inversión como GOOD o BAD.
fig = px.histogram(data, x='sector', color='investment')
fig.show()
Podemos visualizar que no hay presencia de un sector que presente una predominante tendencia de buena inversión o mala inversión. Y al ser la unica variable categorica, se decide no tenerla en consideración en el entrenamiento del modelo.
data_1= data_1.drop("sector", axis=1)
Correlaciones#
Usar correlación para verificar que datos se las variables independientes no cuenten con una relación lineal.
numeric_df = data_1.select_dtypes(include=['int', 'float'])
# Calcular matriz de correlación
matriz_correlacion = numeric_df.corr()
matriz_correlacion
| horizon_days | amount | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | Sharpe_Ratio | expected_return_yearly | inflation | nominal_return | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| horizon_days | 1.000000 | 0.057634 | -0.019103 | 0.041087 | -0.000782 | 0.079032 | -0.000782 | 0.557271 | 0.000982 | 0.314202 | -0.056352 | -0.030855 | 0.043956 | 0.058744 | 0.000010 | 0.049919 | 0.063964 | 0.059054 | 0.045002 |
| amount | 0.057634 | 1.000000 | 0.015068 | 0.019191 | 0.036155 | -0.017119 | 0.036155 | 0.022659 | 0.024025 | 0.035272 | 0.041535 | -0.025461 | -0.005007 | 0.003251 | 0.026403 | -0.072575 | -0.033395 | -0.089766 | -0.098832 |
| price_BUY | -0.019103 | 0.015068 | 1.000000 | 0.975333 | -0.074075 | -0.054223 | -0.074075 | 0.105182 | 0.091847 | 0.046071 | 0.060159 | 0.317998 | 0.679935 | 0.219801 | 0.367833 | 0.112021 | 0.161695 | 0.158972 | 0.004417 |
| price_SELL | 0.041087 | 0.019191 | 0.975333 | 1.000000 | -0.071000 | -0.062839 | -0.071000 | 0.165045 | 0.083859 | 0.142205 | 0.058317 | 0.304729 | 0.698000 | 0.222112 | 0.350046 | 0.122658 | 0.172211 | 0.161950 | 0.002011 |
| Volatility_Buy | -0.000782 | 0.036155 | -0.074075 | -0.071000 | 1.000000 | 0.850632 | 1.000000 | 0.166545 | 0.050158 | 0.065940 | -0.079210 | 0.033027 | -0.225247 | 0.237593 | 0.230903 | -0.156604 | 0.194057 | -0.054318 | -0.265809 |
| Volatility_sell | 0.079032 | -0.017119 | -0.054223 | -0.062839 | 0.850632 | 1.000000 | 0.850632 | 0.171310 | 0.070344 | -0.039453 | -0.123275 | 0.035551 | -0.211890 | 0.187969 | 0.176131 | -0.175065 | 0.175390 | -0.040506 | -0.258205 |
| Sharpe_Ratio | -0.000782 | 0.036155 | -0.074075 | -0.071000 | 1.000000 | 0.850632 | 1.000000 | 0.166545 | 0.050158 | 0.065940 | -0.079210 | 0.033027 | -0.225247 | 0.237593 | 0.230903 | -0.156604 | 0.194057 | -0.054318 | -0.265809 |
| expected_return_yearly | 0.557271 | 0.022659 | 0.105182 | 0.165045 | 0.166545 | 0.171310 | 0.166545 | 1.000000 | 0.022885 | 0.533747 | -0.088662 | 0.085136 | 0.038070 | 0.391668 | 0.361000 | 0.199539 | 0.263973 | 0.243156 | -0.021053 |
| inflation | 0.000982 | 0.024025 | 0.091847 | 0.083859 | 0.050158 | 0.070344 | 0.050158 | 0.022885 | 1.000000 | 0.003025 | -0.004853 | -0.007222 | -0.008672 | 0.048510 | 0.116725 | -0.034308 | -0.009438 | 0.041586 | 0.038014 |
| nominal_return | 0.314202 | 0.035272 | 0.046071 | 0.142205 | 0.065940 | -0.039453 | 0.065940 | 0.533747 | 0.003025 | 1.000000 | -0.105205 | 0.025856 | 0.072892 | 0.118291 | 0.038219 | 0.152078 | 0.101896 | 0.134162 | 0.049182 |
| ESG_ranking | -0.056352 | 0.041535 | 0.060159 | 0.058317 | -0.079210 | -0.123275 | -0.079210 | -0.088662 | -0.004853 | -0.105205 | 1.000000 | 0.112499 | -0.106382 | 0.164561 | 0.012402 | -0.138176 | 0.118752 | -0.429422 | -0.469440 |
| PE_ratio | -0.030855 | -0.025461 | 0.317998 | 0.304729 | 0.033027 | 0.035551 | 0.033027 | 0.085136 | -0.007222 | 0.025856 | 0.112499 | 1.000000 | -0.042664 | 0.117468 | 0.352464 | -0.036344 | 0.054029 | -0.029921 | -0.064672 |
| EPS_ratio | 0.043956 | -0.005007 | 0.679935 | 0.698000 | -0.225247 | -0.211890 | -0.225247 | 0.038070 | -0.008672 | 0.072892 | -0.106382 | -0.042664 | 1.000000 | 0.078800 | -0.045096 | 0.329119 | 0.130749 | 0.311497 | 0.243385 |
| PS_ratio | 0.058744 | 0.003251 | 0.219801 | 0.222112 | 0.237593 | 0.187969 | 0.237593 | 0.391668 | 0.048510 | 0.118291 | 0.164561 | 0.117468 | 0.078800 | 1.000000 | 0.483868 | 0.522420 | 0.822391 | 0.446526 | -0.029877 |
| PB_ratio | 0.000010 | 0.026403 | 0.367833 | 0.350046 | 0.230903 | 0.176131 | 0.230903 | 0.361000 | 0.116725 | 0.038219 | 0.012402 | 0.352464 | -0.045096 | 0.483868 | 1.000000 | -0.043134 | 0.182139 | 0.117530 | -0.176910 |
| NetProfitMargin_ratio | 0.049919 | -0.072575 | 0.112021 | 0.122658 | -0.156604 | -0.175065 | -0.156604 | 0.199539 | -0.034308 | 0.152078 | -0.138176 | -0.036344 | 0.329119 | 0.522420 | -0.043134 | 1.000000 | 0.474356 | 0.733180 | 0.560262 |
| current_ratio | 0.063964 | -0.033395 | 0.161695 | 0.172211 | 0.194057 | 0.175390 | 0.194057 | 0.263973 | -0.009438 | 0.101896 | 0.118752 | 0.054029 | 0.130749 | 0.822391 | 0.182139 | 0.474356 | 1.000000 | 0.452459 | 0.111891 |
| roa_ratio | 0.059054 | -0.089766 | 0.158972 | 0.161950 | -0.054318 | -0.040506 | -0.054318 | 0.243156 | 0.041586 | 0.134162 | -0.429422 | -0.029921 | 0.311497 | 0.446526 | 0.117530 | 0.733180 | 0.452459 | 1.000000 | 0.710006 |
| roe_ratio | 0.045002 | -0.098832 | 0.004417 | 0.002011 | -0.265809 | -0.258205 | -0.265809 | -0.021053 | 0.038014 | 0.049182 | -0.469440 | -0.064672 | 0.243385 | -0.029877 | -0.176910 | 0.560262 | 0.111891 | 0.710006 | 1.000000 |
heatmap = go.Heatmap(
z=matriz_correlacion.values, # valores de la correlación
x=matriz_correlacion.columns, # nombres de las columnas
y=matriz_correlacion.columns, # nombres de las filas
colorscale='Viridis', # paleta de colores
)
layout = go.Layout(title='Mapa de calor de correlación')
fig = go.Figure(data=[heatmap], layout=layout)
# Mostrar el mapa de calor
#pyo.plot(fig, filename='heatmap.html')
fig.show()
Exploración: Modelo Reg. Logistica#
Como hemos visto anteriormente, la mayor parte de las variables no parecen aportar a conocer la variable respuesta. Por lo que, en modo de exploración realizamos un modelo de regresion de logística para conocer un poco más los datos con los que tratamos.
modelo_logit = sm.Logit(data_1['investment'], data_1[['Volatility_sell', 'expected_return_yearly', 'inflation', 'nominal_return', 'price_SELL','price_BUY']])
resultado_modelo = modelo_logit.fit()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[24], line 1
----> 1 modelo_logit = sm.Logit(data_1['investment'], data_1[['Volatility_sell', 'expected_return_yearly', 'inflation', 'nominal_return', 'price_SELL','price_BUY']])
2 resultado_modelo = modelo_logit.fit()
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\discrete\discrete_model.py:475, in BinaryModel.__init__(self, endog, exog, offset, check_rank, **kwargs)
472 def __init__(self, endog, exog, offset=None, check_rank=True, **kwargs):
473 # unconditional check, requires no extra kwargs added by subclasses
474 self._check_kwargs(kwargs)
--> 475 super().__init__(endog, exog, offset=offset, check_rank=check_rank,
476 **kwargs)
477 if not issubclass(self.__class__, MultinomialModel):
478 if not np.all((self.endog >= 0) & (self.endog <= 1)):
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\discrete\discrete_model.py:185, in DiscreteModel.__init__(self, endog, exog, check_rank, **kwargs)
183 def __init__(self, endog, exog, check_rank=True, **kwargs):
184 self._check_rank = check_rank
--> 185 super().__init__(endog, exog, **kwargs)
186 self.raise_on_perfect_prediction = False # keep for backwards compat
187 self.k_extra = 0
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\model.py:270, in LikelihoodModel.__init__(self, endog, exog, **kwargs)
269 def __init__(self, endog, exog=None, **kwargs):
--> 270 super().__init__(endog, exog, **kwargs)
271 self.initialize()
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\model.py:95, in Model.__init__(self, endog, exog, **kwargs)
93 missing = kwargs.pop('missing', 'none')
94 hasconst = kwargs.pop('hasconst', None)
---> 95 self.data = self._handle_data(endog, exog, missing, hasconst,
96 **kwargs)
97 self.k_constant = self.data.k_constant
98 self.exog = self.data.exog
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\model.py:135, in Model._handle_data(self, endog, exog, missing, hasconst, **kwargs)
134 def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
--> 135 data = handle_data(endog, exog, missing, hasconst, **kwargs)
136 # kwargs arrays could have changed, easier to just attach here
137 for key in kwargs:
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\data.py:675, in handle_data(endog, exog, missing, hasconst, **kwargs)
672 exog = np.asarray(exog)
674 klass = handle_data_class_factory(endog, exog)
--> 675 return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
676 **kwargs)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\data.py:84, in ModelData.__init__(self, endog, exog, missing, hasconst, **kwargs)
82 self.orig_endog = endog
83 self.orig_exog = exog
---> 84 self.endog, self.exog = self._convert_endog_exog(endog, exog)
86 self.const_idx = None
87 self.k_constant = 0
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\data.py:509, in PandasData._convert_endog_exog(self, endog, exog)
507 exog = exog if exog is None else np.asarray(exog)
508 if endog.dtype == object or exog is not None and exog.dtype == object:
--> 509 raise ValueError("Pandas data cast to numpy dtype of object. "
510 "Check input data with np.asarray(data).")
511 return super()._convert_endog_exog(endog, exog)
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
p_valores = resultado_modelo.pvalues
variables_significativas = p_valores[p_valores < 0.05]
variables_significativas
Volatility_sell 2.623637e-13
expected_return_yearly 1.624640e-02
inflation 1.012144e-22
nominal_return 8.670997e-23
price_SELL 1.149075e-04
price_BUY 1.378763e-04
dtype: float64
resultado_modelo.summary()
| Dep. Variable: | investment | No. Observations: | 900 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 894 |
| Method: | MLE | Df Model: | 5 |
| Date: | Sun, 21 Apr 2024 | Pseudo R-squ.: | 0.6825 |
| Time: | 16:12:04 | Log-Likelihood: | -190.05 |
| converged: | True | LL-Null: | -598.63 |
| Covariance Type: | nonrobust | LLR p-value: | 2.246e-174 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Volatility_sell | -6.5713 | 0.899 | -7.312 | 0.000 | -8.333 | -4.810 |
| expected_return_yearly | -4.4793 | 1.864 | -2.403 | 0.016 | -8.132 | -0.826 |
| inflation | -7.1526 | 0.729 | -9.811 | 0.000 | -8.581 | -5.724 |
| nominal_return | 24.9465 | 2.539 | 9.826 | 0.000 | 19.971 | 29.922 |
| price_SELL | -0.0258 | 0.007 | -3.857 | 0.000 | -0.039 | -0.013 |
| price_BUY | 0.0279 | 0.007 | 3.812 | 0.000 | 0.014 | 0.042 |
Possibly complete quasi-separation: A fraction 0.38 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Con el método anteriormente utilizado de exploración podemos confirmar nuevamente que: “inflation” y “nominal return” son las que mayor capacidad de aporte tienen. Seguido de Volatility_sell, expected_return_yearly, nominal_return e inflation.
Como conclusión del proceso exploratorio de datos, podríamos mencionar que la mayor parte de las variables presentes en el data frame no tienen una fuerte inclinación a hacia la variable respuesta categorica asociada. Como fue posible revisar durante los box plots y durante el arbol de regresión expuesto para el estudio de las variables, solamente algunas variables parecen ser determinantes para conocer si una inversión será buena o mala. Adicionalmente, por medio de un modelo de regresion hecho en modo didactivo y exploratorio, fue posible observa que la mayor parte de las variables no son significativas a nivel predictivo.